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, Abstract 

The estimation of a density profile from experimental data points is 
a challenging problem, usually tackled by plotting a histogram. Prior 
assumptions on the nature of the density, from its smoothness to the 
specification of its form, allow the design of more accurate estimation 
procedures, such as Maximum Likelihood. Our aim is to construct a 
procedure that makes no explicit assumptions, but still providing an ac- 
curate estimate of the density. We introduce the self-consistent estimate: 
IJ^ ■ the power spectrum of a candidate density is given, and an estimation 

procedure is constructed on the assumption, to be released a posteriori, 
that the candidate is correct. The self-consistent estimate is defined as a 
prior candidate density that precisely reproduces itself. Our main result 
is to derive the exact expression of the self-consistent estimate for any 
, ■ given dataset, and to study its properties. Applications of the method re- 

quire neither priors on the form of the density nor the subjective choice of 
If^ , parameters. A cutoff frequency, akin to a bin size or a kernel bandwidth, 

^ ' emerges naturally from the derivation. We apply the self-consistent esti- 

I mate to artificial data generated from various distributions and show that 

■ it reaches the theoretical limit for the scaling of the square error with the 

dataset size. 

cn 

00 

O ' 1 Introduction 

■ 

' Every scientist has encountered the problem of estimating a continuous density 

from a discrete set of data points. This may happen, for example , when deter- 
mining a probability distribution from a finite Monte Carlo sample ( Binder. 19861) 



r> I rounding off the shape of a galaxy from a collection of stars ( Ripley and Sutherland. 1990l ) , 



' or assessing th e instantaneous fir ing rate of a neuron from a discrete set of ac- 



tion potentials (jKass et al.. 20051) . In all those cases, one can adopt two different 
approaches: either assuming a given functional form for the density a priori, 
specified by a certain number of parameters, or renouncing any prior knowledge 
(beyond that a density exists and, in some cases, that is smooth). These two 
approaches lead, respectively, to parametric and non-parametric estimates. We 
will focus on the latter approach, although we will assume the knowledge of the 
density as a reasoning tool, to be released a posteriori. 



*Department of Neurobiology, Yale University, 333 Cedar Street, SHM-C400D, New Haven, 
Connecticut, alberto.bernacchia@yale.edu 

tThe Niels Bohr International Academy, The Niels Bohr Institute, Blegdamsvej 17, DK- 
2100 Copenhagen, Denmark. 



1 



The most popular non-parametric method is simply plotting a histogram, 

but more sophisticated procedures h ave been developed. Kernel Density Estima- 

tion (KDE) has been widely studied ( Silverman. 19861 Parzen. 1961 : Wand and Jones. 19951) : 



instead of counting the number of points in separate bins, KDE constructs a 
smoothed picture of the data as a superposition of kernel functions centered 
at the coordinates of data points. More formally, given a sample of N data 
points (real numbers), denoted by {Xj} [j — 1, A^), the KDE estimate f{x) 
is written as 

1 V-.-r?!^ 



= (1) 

where K{x) is the smoothin g kernel and h is the bandwidth. Usually, the 



choice of K{x) is not crucial (jSilverman. 19861 ). while ft., which controls the de- 



gree of smoothing, has to be carefully adjusted: the more concentrated data 
points are, the less smoothing is necessary in order to obtain a good esti- 
mate of their density. An altern ative non-parametric meth od is the Maxi- 



mum Penalized Likelihood (MPL, iGood and Gaskins fl971)), also known in 



the physics literature as a regularization of Field Theory (jBialek et al. . 19961: 



Holv. 19871: ISchmidt. 20nnll : it consists m performing a functional average of 



densities weighted by their likelihood and by a measure of their smoothness. 

In general, each non-parametric method depends on the arbitrary choice of 
an adjustable parameter, such as the bin size in histograms, the bandwidth in 
KDE, or the cutoff frequency in MPL and Field Theory. Each of them regular- 
izes the estimate and avoids overfitting of the data points. In most cases this 
corresponds to low-pass filtering, i.e. cutting the high frequencies inherent to the 
discrete dataset, and preventing the estimate from merely reproducing a nar- 
row peak at each data point. However, it would be desirable to devise methods 
involving the least possible number of parameters, since their determination usu- 
ally involves some specific assumptions on the distribution to be estimated (e.g. 
varying the cutoff parameter in MPL a nd Field Theory pr ecisely corresponds 



to different choices of the bayesian prior (jBialek et al.. 1996|)). Cross- validation 
techniques have been previously applied for this purpose (jBowman. 1984 ). but 
they are computationally expensive and have been seldom applied in the liter- 
ature. 

In this study we show that a self-consistent approach leads to the emergence 
of a natural cutoff frequency, and an estimate of the density whose performance 
approaches the theoretical limit for the scaling of the square error with the 



dataset size. We start from the observation, made in (jWatson and Leadbetter. 1963f ). 
that a unique " optimal" convolution kernel can be derived as a function of the 
power spectrum of the (unknown) density to be estimated. This result alone is 
of little use, since the power spectrum of the true density is not known a priori. 
However, in Section 2 we exploit the result by defining the "self-consistent" esti- 
mate as the one whose associated optimal kernel, applied to the sample dataset, 
returns the estimate itself. Our main result is to derive the exact expression 
of the self-consistent estimate for any given dataset, and to study its proper- 
ties. In Section 3 we test the method on three different problems: the estimates 
of Gaussian, Cauchy and Comb distribution. In all cases we show that the 
self-consistent estimate outperforms existing methods and the mean integrated 
square error reaches the optimal theoretical scaling ~ N~^. Technical material 
is presented in the appendices: in Appendix 1 we replicate and extend the re- 
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suits on the existence of the optimal kernel. In Appendix 2 we provide details 
of the derivation of the self-consistent estimate. In Appendix 3 we prove that 
the self-consistent estimate converge almost surely to the true distribution for 
large N. 



2 The self-consistent estimate 



In this section, we define the self-consistent estimate, we derive its exact expres- 



sion, and we study its properties. We start from a result derived bv lWatson and Leadbetter (1963h . 
that we replicate and extend in Appendix 1. The basic result is that a unique, 
optimal convolution kernel can be derived as a function of the power spectrum 
of the density to be estimated, where "optimal" is intended as minimizing the 
mean integrated square difference between the true density and its estimate. 

Given a sample of N data points (real numbers), denoted by {Xj} (j = 
1 . . . iV), each independently drawn from a probability density distribution f{x), 
we write the estimate as 



1 ^ 

/(.T) = -5]ir(x-X,) (2) 



where we assume f,fGL'^. Note that © does not depend on any bandwidth 
h, contrary to the KDE estimate ([Ij. Instead of choosing an arbitrary shap e for 



the kernel K, and looking for an optimal bandwidth (see e.g. [Silverman 
we rather look for an optimal shape of the kernel. It turns out that the Fourier 
transform Knpt(t) of the optim al kernel Kopt{x) is equal to (see Appendix 1 and 



Watson and Leadbetter (1963f) ) 

N_ 

N-i + m)\ 

where (j){t) is the Fourier transform of the true density f{x) (characteristic func- 
tion). The optimal kernel Kopt{x) is symmetric with respect to a; = 0, where it 
takes its maximum value. Note that \(f>{t)\ in Eq.(I3]) requires the knowledge of 
the true density, which is not available, hence Eq.Q cannot be used to com- 
pute the estimate in Eq.([2]) from the sample observations {Xj} alone. We show 
in the following how to circumvent thi s problem by a self-consistent approach. 
Eq.JS]) has been previously derived bv lWatson and Leadbetter C1963 ). and has 



been used for assessing the performance of spe cific kernels (|pavis. 1977 ) , as well 



as for constructing blockwise estimators (Efro movich. 20081 ). 

Although Eq.(|31) cannot be used to compute the density estimate, we make 
a step further and we write the Fourier transform (/)(i) of the density estimate 
f{x) in Eq.(I2]), using the transformed kernel as 

^it) = m^opdt) = j^^^^^. (4) 



where A(i) is the empirical characteristic function, i.e. 

N 

N 



1 ^ 
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Our approach is to construct an iterative procedure based on Eq.([3]), and to 
determine its exact fixed point. We replace the unknown term cj) in Eq.Q with 
an initial guess 0O: and we denote the resulting estimate as 01. Then, we try to 
obtain an improved estimate 02 by using a kernel which is optimal for (pi. By 
iterating this procedure, we construct the following sequence of estimates 

We search for a fixed point of the iteration, namely an estimate (psc for which 

(7) 



This coincides with the density whose corresponding optimal kernel applied to 
the data sample gives back the density itself. We call the resulting estimate 
a "self-consistent estimate". We derive in Appendix 2 the stable solution of 
Eq.©, which is equal to 



2{N -I) 




^_ 4(7V-1) 



7V2|A(i)|^ 



I Ait) (8) 



where / is the indicator function {lA{t) = 1 if i G A, /^(t) = if t ^ A) and 
A is the set of "accepted" frequencies, i.e. the frequencies giving a nonzero 
contribution to the estimate. In order for to be a stable solution of (O, the 
set A must be contained in i? (ACS), where t ^ B li and only if 

lAWP > (9) 

This condition sets a threshold for the amplitudes of frequencies t below which 
(pscit) = 0. Hence, the contribution of small amplitude waves is neglected, and 
this automatically determines the range of frequencies to be considered for the 
estimate. In most practical situations, the filter will cut the high frequency 
bands, but the filter is not constrained to be low-pass, and it can rather select 
different frequency bands. 

The condition A C B leaves the arbitrary choice of a subset of frequencies 
among those above the threshold set by condition ©. As demonstrated in Ap- 
pendix 3, the self-consistent estimate converges almost surely to the true density, 
provided that A is bounded, where the bound grows with N (in addition, the 
characteristic function is required to be integrable). In practical applications, a 
bounded interval must be necessarily implemented, and we selected 

A^ Bn[-~t*,t*] (10) 

where, as defined above, B — {t : |A(i)p > ^'"^^^^ }, and t* is set in such 
a way that the inequality © holds in one half of the interval [~t*,t*]. This 
choice is consistent with the sufficient conditions for asymptotic convergence 
(see Appendix 3) and has been shown to work well in applications (see Section 

3)- 

Since (psc and A are bounded, the self-consistent estimate in Fourier space, 
Eq.®, can be antitransformed back to the estimate in real space, i.e. 
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Figure 1: Amplitude gain of the self-consistent estimate (full line), G = 
|(/)sc|/|A|, as a function of the squared input amplitude |Ap and for different 
values of the sample size N . Dashed line shows the unstable solution (see Ap- 
pendix 2). The amplitude gain exists only if the inequality Q is satisfied, and 
is always smaller than 1, implying that the self-consistent estimate attenuates 
the amplitudes in the input. The exception is for |Ap = 1, for which G = 1, 
corresponding to the normalization condition. G tends to one for large values 
of N (while the unstable solution vanishes), implying that the estimate tends 
to reproduce all frequencies of the input in that case. 
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Figure 2: Illustrative example of an estimate of a Gaussian density from N — 200 
sample points. The true density is given by the full line. The dashed line is the 
self-consisten t estimate, compared with a histogram having optimal binwidth 
()Scott. 19791) . 



1 r+°° 

fscix) = — / e-''^cbscit)dt (11) 

Applications of the method are considered in the next section, here we describe 
its properties. A graphical illustration of the filtering properties of the estimate 
is given in FiglU where the amplitude gain G = \(j)sc\/\^\ is plotted. 

The self-consistent estimate is normalized, i.e. fsc{x)dx = 1 or, equiva- 

lently, 4'sciO) = 1, as a consequence of the normalization of the empirical density 
(A(0) = 1). Beside the zero frequency that is kept intact (corresponding to the 
normalization condition), the self-consistent estimate attenuates all other fre- 
quencies (|A| < 1 implies \(j)sc\ < |A|, see FigH]). Because (^sc(i) is continuous 
and infinitely differentiable at i = 0, all the moments of the self-consistent esti- 
mate fscix) exist. The mean and variance of fscix) are equal to (see Appendix 
2) 

1 ^ 
1 ^ 

Varix)^j;^Y.(^,-E{x))' (13) 

While the mean is equal to the sample mean, the variance is larger than the 
sample variance as well as than its unbiased estimator, which is normalized by 
instead of jfZjz- 

A drawback of the self-consistent estimate is that it is not guaranteed to 
be non- negative, while the true density is non- negative (note that |0sc(i)P < 1 
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holds from Eq.®, because |Ap < 1, but that is a necessary and not sufficient 
condition for fsc{x) to be non- negative) . However, we seldom observed negative 
values in simulations. Those can be corrected without any error cost by trans- 
lating the estimate downward un til the positive part is normalized to one, and 
setting to zero the negative part ( Glad et al.. 2003t Ushakov. 1999f) . In general, 
the restriction to a strictly non- negative estimate has a cost, in terms of the 
mean integrated square error E{I), quantified by the decay exponent a of the 
error as a function of the sample size, E{I) oc N~°'. Among the estimation 
procedures that properl y give non-negative results, hist ograms, MPL and Field 
Theories have a = 2/3 |Bialek et al.. 19961: IHoIv. 19871). which is also the limit 
of KDE when the density is discontinuous ("Usha kov. 19991) . For a continuous 
density, KDE improves to a = 4/5 ( Silverm an. 198^. Applications of estimates 
that are allowed to be negative reach better performance, like " m-th order" ker- 
nels (|Wand and Jones. 19951 iHall and Marron. 1987t iBerlinet. 19"93l) , that have 
a = (p rovided that th e density is m— 1-th differentiable) , while infinite 

order kernels (iDevrove. 19921 ). and the Sine kernel (|Davis. 1977tlSchmidt. 2000t 
iGlad et al.. 2007t) . yield a = 1 (for infinitely differentiable densities, besides 
logarithmic terms). Hence, releasing the requirement of a non- negative density 
estimate allows to improve the performance. The optimal scaling a = 1 is also 
reached by parametric estimators, such as Maximum Likelihood which are, how- 
ever, strictly non-negative. We present below numerical results suggesting that 
the self-consistent estimate (O also reaches a = 1 for infinitely differentiable 
densities. 



3 Applications to artificial data 

In this section we apply the self-consistent estimate to artificial data. The esti- 
mate fsc{x) is constructed from a sample of N data points {Xj} {j — 1 . . . N), 
by using Eas.(l5|).(l8llTT|). As an illustrative exemple, we present in Fig. [2] the 
application of the self-consistent estimate to a Gaussian sample of iV = 200 
points: with respect to a histogram (where we chose the optimal binwidth, see 
Scott (19791) ). it appears to be particularly accurate for assessing the density of 



the tails of the distribution. 

We test more systematically the performance of the self-consistent estimate 
on artif icial datasets generated fr om three distributions: Gaussian, Cauchy and 
Comb (jMarron and Wand. 19921 ). We compare the performance of the self- 



con sistent estimate (SC) with two kernel esti mators, Gau ss ian kernel (KG , 



see 



Silverman (19861) ) and Sine kernel (KS, see lOavis (19771 ): ISchmidt (2000l) : 



Gla,d et al. (20071) ). the latter being representative of non strictly positive esti 



mators. For the Cauchy distribution we also show the re s ults for a kerne l with 
a locally adaptive bandwidth (APT, see [Silverman (1986[ ): lHossier (19961) ). For 



each density f(x) and each estimator, we generate 100 samples, each composed 
of N points randomly and independently drawn from /(x). The performance 
is evaluated in terms of the mean integrated square error. Eg ([2^ . where the 
mean is calculated over the 100 sample realizations. We repeat the above pro- 
cedure for different values of N, ranging from lO'^ to 10^. In addition to the 
simulation results, we also present the theoretical bound given by the optimal 
kernel ([3]), i.e. the best an estimate of the type of Eq.® can achieve (OPT, 
see also Ea. ipS)) ). In the case of the Gaussian density we also show the the- 
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Figure 3: Mean integrated square error E{I) for the estimate of a Gaussian 
distribution (inset), as a function of the sample size N. Standard errors are 
about 5% and smaller than the size of symbols. Density estimates: Gaussian 
kernel (KG), Sine kernel (KS), self-consistent estimate (SC), each point is an 
average over 100 realizations of the sample. Theoretical bounds: optimal kernel 
(OPT), maximum likelihood (ML). SC is applied without any prior knowledge, 
while ML assumes that the density is Gaussian and OPT requires its power 
spectrum in advance. 
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oretical bound given by Maximum Likelihood (ML). The bandwidths for the 
Gaussian and Sine kernels are chosen from the sample data foUowing estabil- 
ished empirical rules: h = 0.79 * i q * N~^/^ for the Gaussian kernel {iq is the 
interquartile range, see Eq.(3.29) in lSilverman ('1986I) ). while for the Sine kern el 
/i = max{iJ : |A(l/ff + p)|2 < log(iV)/iV,V/9 G (0,log(Af))} Csee IPolitis ^2003h V 



First, we study the Gaussian distribution, /(x) = e~^'' /"^ / \/2t^ . Fig. [3]shows 
the mean integrated square error E{I) as a function of the sample size N , for 
the self-consistent estimate (SC), the Gaussian and Sine kernels (KG, KS), the 
optimal bound (OPT) and the Maximum Likelihood (ML). For the Gaussian 
kernel with a fixed bandwidth /i, the exact expression for the error is 



The average value of the bandwidth is h = 1.06 * N^^^^ , which gives an ap- 
proximate value of the error 

E{Ikg) - 0.33 * The error of the optimal 

estimate is given by Ea. (l28D . and for a Gaussian density is equal to 

, ^^7^Li.(l-jV)-l ^/i^ 
E{Iopt) = (15) 

where Li is the Polylogarithm function, defined by Lis(z) — X^fcLi fs"- The error 
for ML is equal to 

E{Iml) - T^N-^ (16) 

FigEl shows that the error of the SC estimate is consistently smaller than 
both kernel estimates, KG and K S. Both SC a nd KS approach the theoretical 



OPT scaling ~ N-^.J\og{N) fsee lOavis il^im . while ML scales ~ iV"!. We 



stress that both ML and OPT require prior knowledge of the density to be 
estimated. The former needs to know that the density is Gaussian, the latter 
needs its spectrum in advance, while the self-consistent method achieves the 
same scaling without any prior assumption. 

The second application is the estimate of a Cauchy distribution, f{x) — 
[7r(l-|-a;^)]~^. The interest of this case comes from the difficulties of binning long- 
tailed distributions, especially when the variance diverges. For the Gaussian 
kernel with a fixed bandwidth, the error is 

(17) 

The average bandwidth is h = 1.58 * iV-l/^ for which E{Ikg) - 0.55 * 
iV-4/5. The error of the optimal estimate. Eg. ((28)) . in the case of the Cauchy 
distribution, yields 

i.(W) = ^^"^^^^"^ . (18) 
^ " N -I 2ttN ^ ' 

In order to cope with long-tailed distributions, kernel methods have been 

generalized to the " adaptive" kernel, which allows the bandwidth to vary locally 
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10^ 10^ 10^ 10^ 10^ 

N 

Figure 4: Mean square error E{I) for the estimate of a Cauchy distribution 
(inset) as a function of the sample size N. Standard errors are about 5% and 
smaller than the size of symbols. Density estimates: Gaussian kernel (KG), Sine 
kernel (KS), adaptive kernel (APT), self-consistent estimate (SC), each point 
is an average over 100 realizations of the sample. Theoretical bound: optimal 
kernel (OPT). SC is applied without any prior knowledge, APT applies to long- 
tailed distributions, OPT requires the power spectrum of the true density in 
advance. 



according to a first estimate of the density ( Silverman. 1986t Hossier. 1996 ) . In 
Fig. 0] we show the mean square error as a function of A'' for SC, KG, KS 
estimates, the adaptive kernel estimate (APT), and the OPT bound. For small 
datasets, the adaptive kernel method performs best. However, the self-consistent 
method still shows better scaling with N, and its error is lower for large sample 
sizes, with a crossover occurring for N between 10'' and 10^. Again, the SC 
estimate performs better than both kernel estimates KG an d KS. Both S C and 
KS estimates approach the OPT scaling N''^ log{N) fsee lDavis Q977I )). We 
stress that the adaptive method requires some prior knowledge: it is used when 
one knows that the distribution is long-tailed and, again, the optimal estimate 
requires prior knowledge of its power spectrum. Conversely, we applied the 
self-consistent method blindly, in the same way as we did in the Gaussian cas e. 

The last application is the Comb distribution (jMarron and Wand fl992i )'). 
which is multimodal, where different modes have different widths, and its trans- 
form is affected by a large interval of frequencies. FiglS] shows the results for 
SC, KG, KS estimates and the OPT bound (which is computed numerically 
using Ea. (|28| ). Note that KG performs poorly in this case, because the em- 
pirical bandwidth is unable to capture the large interval of frequencies of the 
Comb distribution. Instead, both SC and KS embrace an appropriate interval 
of frequencies, estimated from the empirical characteristic function, they per- 
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form much better than KG and approach the OPT scahng. For large N, the 
SC estimate seems to perform shghtly better than KS. Its mean error tends 
also to decrease regularly with increasing dataset size, while the curve for the 
KS seems irregular (see e.g. the point at = 10'^). This phenomenon may 
be caused by the several maxima in the characteristic function of the comb 
density, each roughly corresponding to 1 /width of one of the six modes of the 
density. This generates local minima in the mean square error of the Sine kernel 
as a function of it s bandwidth, corresponding to an abrupt frequency cutoff (see 
Glad et al. (20071 )). The aforementioned irregular behavior may be due to a 
jump across two minima and could be absent in the SC estimate because of the 
absence of a bandwidth. 

Finally, we remark that the self-consistent estimate may perform poorly 
when applied to particular families of density functions. As explained in Ap- 
pendix 3, if the density is not square integrable or its characteristic function 
is not integrable, then the self-consistent estimate is not guaranteed to con- 
verge to the true density for large N. Simulations (not shown) suggest that the 
self-consistent estimate does not perform well when applied to the box function 
(whose characteristic function is not integrable) and to the distribution with 
one degree of freedom (which is not square integrable). 




2 3 4 "i fi 

10 10 10 10 10 

N 

Figure 5: Mean square error E{I) for the estimate of a Comb distribution (inset) 
as a function of the sample size N. Standard errors are about 1% and smaller 
than the size of symbols. Density estimates: Gaussian kernel (KG), Sine kernel 
(KS), self-consistent estimate (SC), each point is an average over 100 realizations 
of the sample. Theoretical bound: optimal kernel (OPT). SC is applied without 
any prior knowledge, OPT requires the power spectrum of the true density in 
advance. 
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4 Discussion 



We presented a method that estimates a density in a self-consistent way from a 
finite sample. This approach produces a unique estimate where no parameters 
have to be adjusted, and the only prior is the belief in the self-consistent pro- 
cedure. On the other hand, the cases in which there exists a widely accepted 
theoretical framework for the system at hand would rather point to a bayesian, 
parametric approach in which a specific form for the density is postulated. 

The self-consistent estimate converges to the true density for large N and it 
has a better performance and scaling than existing methods for all the examples 
we studied. Together with the simplicity of its implementation, these features 
make it preferable for applications, especially when large datasets are available. 
However, if the true density is not square integrable or its characteristic function 
is not integrable, the self-consistent estimate is not guaranteed to converge and 
it may perform poorly. 

A frequency filter emerges naturally in the derivation of the self-consistent 
estimate, as a function of the data sample, and prevents overfitting of the 
data. When long tails or power-law behavior is suspected, one is tempted 
to u se logarithmic binmng, but that has bee n shown to be highly inaccu- 
rate ( Clauset et al.. 2009t [Goldstein et al.. 2004 ). and sometimes leading to the 
conclusion that the density has power-law tails when it has not. The self- 
consistent method constitutes a good alternative when there are no solid theo- 
retical grounds to assume (or discard) a power law. 

For small values of N, adaptive bandwidth kernels may perform better than 
the self-consistent kernel (see FigH]). Note that if the bandwidth is allowed 
to vary locally, the performance of the estimate is not bounded by the opti- 
mal kernel performance, since it does not belong to estimates of the type of 
Eq.©, and can in principle perform better. Future studies will be devoted to 
extend the self-consistent approach to estimates in which the kernel may vary 
locally. For example, a generic non-adaptive estimate can be converted into an 
adaptive one by a transformation of x lea ding to a space with uniform measure 
()RuDDert and Kline. IQOl IPeriwal. 19971 ). 

The self-consistent method can be applied concretely to any problem in 
which a density is sampled and there is no prior knowledge of its functional 
shape. For instance, the mode of the instantaneous firing rate of a cortical 
neuron may indicate whether the cho ice of a behaving m onkey is triggered by 
a memorized object or a spatial cue ()Qlson et al.. 2000l) . Another application 
could be the densiy estimate of the spacings between zeros of the Riemann Zeta 
function, based on very large nume rical dataset, w hich could help investigate the 
related mathematical conjectures (lOdlv zko. 1987t) . Finally, the method can b e 
used to analyze samples obtained from Monte Carlo simulations (Bind er. 1986[) . 

We remark that the method could not be applied to the case of integer 
numbers. In general, when data points are uniformly spaced by multiples of 
any constant length, the Fourier transform of their density is periodic. An 
amplitude threshold and a cutoff frequency would be meaningless in that case, 
and this would preclude any practical application. However, when data points 
are integers there is usually no need for filtering, and it is sufficient to use a 
histogram, counting the occurrences of each number. 

Finally, a straightforward generalization is to consider a finite interval [a, b] 
instead of the entire line (— oo, -|-oo), by using Fourier series instead of Fourier 



12 



transforms. Another possibility is to apply the method to high- dimensional dis- 
tributions. In our derivation, the relevant variables are scalar quantities, and the 
d— dimensional analogy is obtained by using the c?— dimensional Fourier trans- 
form, i.e. by performing the integral -j^^pyc Jsj^d dt'^ instead of ^ Jsr dt. Among 
many possible implementations, the metho d could be then app lied to the anal- 
ysis of multie lectrode neuronal recor dings (jBrown et al.. 20o3 ) , multivariate fi- 
nancial data (IBreymann et al.. 20031). and re construction of Ramachandran an- 
gle distributions ( Klevwegt and Jones. 19961) . 



Appendix 1: Derivation of the optimal kernel 

In this section, we show that a unique "optimal" convolution kernel can be 
derived as a function of the power sp ectrum of the density to be es timated. A 
similar result has been presented in IWatson and Leadbetter Given a 

sample of N data points (real numbers), denoted by {Xj} (j = 1 . . . N), each 
independently drawn from a probability density distribution f{x), we write the 
estimate as 

1 ^ 

The true density / is assumed to be normalized, i.e. f{x)dx = 1. 

We look for a kernel K{x) such that the estimate minimizes the mean 
integrated square error 

E{I) - E / fix) - fix) dx (20) 



where i?(-) denotes an average over all the possible realizations of the data 
sample {Xj}. In order to minimize (120L w e follow a procedure for signal decon- 



volution (the Wiener filter (| Wiener. 19491 )'). We introduce the Fourier transform 



of the unknown density (characteristic function) 

/+00 
e''^fix)dx (21) 
-oo 

The normalization condition now reads 0(0) — 1. We also call (l>it) and K(t) 
the Fourier transforms of the estimate fix) and of the kernel Kix) respectively. 
The mean integrated square error (j20p corresponds to the mean square distance 
between the true density / and the estimate /, in terms of the Euclidean metric 
in the Hilbert space (we assume f,f,(l),(t) S L^). By means of Parseval's 
theorem, we rewrite ([20]) in Fourier space as 



1 



-\-oo 



EiD = TT^ / m - m dt (22) 

It is straightforward to perform the average in Fourier space. Applying the 
convolution theorem to Ea.([T9|. the transformed estimate is equal to (/)(<) = 
nit) Ait), where 



1 ^ 

^W = ]^E^^*''' (23) 



i=i 
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is the empirical characteristic function. Note that IS. ^ LP' for any finite value 
of N, but e by assumption. Using £'(A) = and £;(|A|2) = |<^|2 + 
N^^ (l — we can rewrite the error as 



E{I) = ^J ^ (1 - m + l^/'Pll - ^\'}dt. (24) 

Since f{x) is a density, it is normalized and non-negative, which implies |0p < 1. 
Then, the first term in the integral, proportional to is non-negative: it 

corresponds to the error due to the finite size of the sample, while the second 
term does not depend on TV. These two sources of er ror are known as error 



variance and error bias respectively (jSilverman. 19861 ). Among the possible 



choices of the kernel, we search for the one minimizing the mean integrated 
square error. Since Ea. (j24p is quadratic in k, it is straightforward to find its 
global minimum, by setting to zero the functional derivative of £'(/) with respect 
to K, i.e. 

2^^fv^ = N-'k (1 - - |</,|2(i -^) = Q (25) 

where the asterisk denotes complex cojugate. This yields a unique, optimal 
kernel, which in Fourier space reads 



optit) = . , (26) 



N 

N-iT\m\ 

The optimal kernel satisfies the normalization condition Kopt(O) = 1, because 
0(0) — 1, and is a real function. Since the density / is real, then \(t){t)\ = \(t){—t)\, 
which implies that Kopt{t) is an even function. Then its antitransform Kopt{x), 
the optimal kernel in the real space, is also real and even and, because expression 
(|26p is non- negative, Kopt{x) takes the maximum value at a; = 0, i.e. at the 
coordinate of each data point in Ea. (fT9| . 

Using the expression for the transformed optimal kernel, Eq. (1261) , we rewrite 
the estimate, Eg. p^ . in Fourier space, as 

which we call the " optimal estimate" . The optimal estimate satisfies the normal- 
ization condition, (/>(0) = 1, because A(0) = 1 and 0(0) = 1. For infinite sample 
size {N — oo), the optimal estimate reduces to the true density with probabil- 



ity one, i.e. M t) 4'{t), because A{t) — (f){t) (see lCsorgo and Totik. Q983D . 

m 



Ushakov (19991 ) for the detailed sufficient conditions) , while the fractional term. 



the kernel, tends to one. This is because an infinite sample would reproduce 
the true density itself, without the need of any transformation of the data. For 
finite N, the optimal estimate cuts the frequencies that have less power in the 
true density, and hence are more subject to noise, i.e. frequencies t whose power 
is of the order |0(t)P ~ 1/A^ or less. 

The above proced ure is analogou s to the derivation of the Wiener filter for 
signal deconvolution ( Wiener. 19491) : the prior knowledge of the power spec- 



trum of both the signal, the unknown density, and the noise establishes a 
unique criterion for optimal signal to noise separation (note that, once the sig- 
nal spectrum is given, |i?(A)p = the assumption of independency of data 
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points allows the noise spectrum to be written as a function of the signal, i.e. 

E{\A\')-mA)\' = N-Hi-m).^ 

We conclude this section by deriving the minimum square error obtained by 
the application of the optimal kernel, that will be useful for assessing the perfor- 
mance of practical applications of the method. By substituting the expression 
(j26p of the optimal kernel in (j24p . the associated minimum square error can be 
written, after some algebra, as 

minx E{1) = jV-1 ^^^^ 

where we made explicit the dependence of the optimal kernel on the sample size 

N, by writing Kopt = K^opti^)- 

Finally, note that < 1 and |A(t)p < 1, which implies from Eg. ip?)) 

that also the optimal estimate satisfies |(/>(i)P < 1. While this is a necessary 
condition for the antitransform f(x) to be non-negative, it is not sufficient, and 
f{x) is not guaranteed to be a non- negative density. 

Appendix 2: Derivation of the self-consistent es- 
timate 

In this section we derive the expression for the self-consistent estimate (j)scj 
Eq.®, and we study its stability. We start from Eq.®, the iterative map that 
we rewrite here 



: ATA 



We search for a fixed point of the iteration, namely (psc such that 

, iVA 



(29) 



(30) 



7V-l+|0,,|-2 

We derive in the following the two solutions (beyond the null solution) of Eq. ((30)) 
and we show that only one solution is stable with respect to the iteration, 
Eg. ip^ . We start by taking the absolute value of Eq. ([50)) . in order to obtain an 
equation for the single unkown variable \4>sc\ (note that (j)sc is complex valued). 
Then, wc multiply the expression by the denominator and by \(l)sc\ (leaving the 
null solution (j)sc — 0), obtaining a simple quadratic equation 

iN~l)\^scf + l = N\A\\4,,,\ (31) 

Provided that |Ap > this equation has the following two solutions, 

denoted by the superscript ± 



2{N- 1) 



/ 4(iV-l) 
V iV2|AP 



(32) 



This solution gives the absolute value of (j)^ . By replacing this expression back 
into the right hand side of Eq. ([50)) . we obtain the solution for 0*, given by 



15 



2(iV- 1) 



1± 



4(A^- 1) 
7V2|A|2 



(33) 



While (j)'^ is normahzed, (j) is not, i.e. when t = 0, A(0) = 1 imphes (f>~^{0) = 1, 
while (f>~ {0) ~ jfzj- The two solutions are of very different magnitudes; for 
large N the solution vanishes (|(/)~| ~ ivjA|)' '^hile 0+ stays finite (|(?!)^| — 
|A|). For all values of |A| for which the solutions 4>^ exist, the following relation 
holds: \4>+\\4>-\ = T^- 

We show that 0+ is a stable solution of the iteration, while 4>~ is unstable. 
This can be seen by taking the absolute value of Ea. (p9)) and computing the 
derivative 



d\(f>n\ 



It 



I0.NI0±I 



i{N- 1) 
7V2|A|2 



(34) 



This implies that, provided that the two solutions exist, i.e. provided that 
I A|2 > ^^^7^'' , then 0+ is stable (derivative smaller than one) and 0^ is unstable 

(derivative larger than one). When |A|2 = ^^^i-^^, the two solutions annihilates 

-1) 



in a saddle node bifurcation. For lAP < 



Ar2 



only the null solution, 0sc=O, 



is available. That is always stable, as can be checked by computing 



lim 



= 



> '"''j^-i " 7 ttie iteration 



reaches 



(35) 
for 



\4>,^Ho d\(t)n\ 
In summary, when |A|2 > ^^'^^^^ 

\^o\ < I'P^l and <psc = 4>+ for |0o| > l-^"' "'^^^'^ '^'^ ^^^^ 
globally stable solution is (j)sc ~ 0. As described in the main text, we define the 
set B oi t values as 



, while for |A|2 < '^'''^2 ^' , the unique. 



B^\t:\A{t)\'> 



2^ 4(iV_l) 



7V2 



(36) 



Hence, (j)sc{t) = when t ^ B. However, t E B does not guarantee that 
(f>sc{t) = (f>'^{t), since a small initial guess |(/>o(OI < would determine 

(l>sc{t) — 0. Then, we define A as the set of t values for which |0o(OI — 
and hence 4'sc{t) = 4>'^{t). The arbitrary choice of the initial guess translates 
into the arbitrary choice of the set A, provided that A C B, a.s required by the 
existence of the nonzero solutions. This is summarized in Eq.®, and concludes 
the derivation. 

We conclude this section by calculating the mean and variance of the self- 
consistent estimate fsc, Eg. pTj) . In a finite neighborhood of t — 0, 4>sc{t) — 
(l>'^{t), which is continuous and infinitely diff'erentiable at i = 0. Then, the 
mean and variance can be computed by the derivatives of cf)^ and at 
t = 0, namely 



Eix) 



dt 



t=o 



dA 



A=l 



.dA 

''~dt 



(37) 



16 



2 di2 



d\A\ 




(38) 



|AP = 1 

where we used the chain rule and the fact that |Ap is an even function, hence 

'^^dt \t =o ~ ^ ' straightforward to show that the terms in round brackets 

in Eas. (|37l38p are, respectively, equal to the sample mean and sample variance. 
Because |A| is even, then ^^^|^_]^ = 1 and, by differentiating Ea. (|3T|) . we obtain 

A+|2| 



^pJfI|A|2=i = ^/(^ ~ 2). Finally, Eas. (|37l38p can be rewritten as 

1 ^ 



1 ^ 

y^r{x) = j^Y.(^,-E{x)r (40) 



Appendix 3: Asymptotic convergence of fsc 

In this section we investigate the asymptotic (large N) behavior of the self- 
consistent estimate. We study the sufficient conditions for the estimate fsc to 
converge to the true density / for N ^ oo. In particular, we prove the following 

Theorem 

If the true density f{x) is square integrable and its transform is integrable, then 
the self-consistent density estimate fsc{x), defined by Eas.(l5|).(l51 ITT|) . converges 
almost surely to the true density for large N ^ under the additional assumptions 

lim r = oo (41) 

N^oo 

lim ^ = (42) 

Proof 

Because the true density f{x) and the self-consistent estimate fsc{x) are both 
square integrable, we can express them as Fourier transform of, respectively, 
<j){t) and (j)scit). By assumption, the characteristic function is integrable, i.e. 

\(t){t)\dt < (X) (43) 

In the following sequence of inequalities, we find an upper bound for the 
difference between the true density and its estimate, and we use the fact that 
4'sc{i) = for \t\ > t* . In order to prove the theorem, we show that the upper 
bound tends to zero for large N. 
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fscix) - f{x) 



27r 



+ CXD 



< 



27r 



+ 00 



27r 



+ 00 



4>sc{t) ~ 4>{t) 



dt < 

dt = 



2ti 



(t>sc{t) - 



27r 



4>scit) ~ A(t) + A{t) - (j){t) 



ZTT 



|t|>f 



|t|>f 



< 



27r 



<^,c(t) - A(t) dt+^ f \A{t) - (f>{t)\ dt+^ 

2,71 I ZTT 



m)\dt = 
10(01 



iti>t* 



where A(i) is the empirical characteristic function, see Eq.(IS])- For increasing 
N, t* increases following the limiting bounds given by Eas. (l41l42p . Then, the 
second integral in the last expression tends to zero because of theorem 1 in ref. 
( Csorgo and Totik.. 19831) , while the third integral tends to zero because of the 
integrability of the characteristic function (j){t), EQ. p5)) . In order to prove the 
theorem, we have to demonstrate that also the first integral tends to zero for 
large N. We use the expression of (psc, Eq.®, and we denote by A+ and A~ the 
set of values of t for which |A(t)p is, respectively, above or below the threshold 
set by Eq.®, i.e. |A(OP > 4{N-1)/N^ or |A(OP < 4:{N-l)/N^. Then, the 
first integral is equal to 



where the term in curly brakets is non-negative. The first term can be expanded 
by means of the inequality \/l ~ x > 1 — ^/x for x G (0, 1), while in the second 
term we substitute the integrand with its maximum value in the interval, and 
we use the fact that the length of the interval is smaller than 2t* . Then, the 
above expression is smaller than 



1 

'2^ 



4{N - 1) 
iV2|A(t)|2 



i-f ,f)nA- 



dt 



|A(t)| dt 



1 

< — 

- 2tt 



< 



1 

2^ 



(-t*,t*)nA+ 



(-t*,t*)nA+ 



1 



|A(t)| 
N - 1 

|A(t)| 

y/Wn:^ N-1 



1 



< 



t* 



dt 



dt 



1 



1 



2t*VN^ 
5dV 



2t*y/N - 1 



< 



< 



2VN^ 
N 



where we used |A(t)| < 1 in the last inequality. Because of Ea. (|^^ . the last 
expression tends to zero for large N, thus proving the theorem. 
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Comments 



The above theorem assumes that the true density is square integrable and its 
transform integrable. We expect the self-consistent estimate to perform poorly 
when applied to densities that do not meet those criteria, and its asymptotic 
consistency is not guaranteed. Examples of such distributions include the box 
function and the distribution with one degree of freedom. 

The condition ([^^ gives the limit on the frequency bound t* , setting the 
maximum increase of t* with N. In simulations, the magnitude of the bound 
t* depends on the threshold of the empirical characteristic function, Eq.®, 
in such a way that its amplitude is larger than the threshold in half of the 
interval {—t*,t*). Now we argue that this choice satisfies the assumptions of 
the theorem, without giving a formal derivation. 

Since the characteristic function is integrable, Ea.(P5|. then 

lim tm)\ = (44) 

\t\~^oo 

We approximate the empirical characteristic function A(t) with the true char- 
acteristic function (j){t) , which is reasona ble provided that N is lar ge and t does 



not increases exponentially with TV fsee lCsorgo and Totik. nQSGI) !. Then, the 



threshold condition © becomes > 4(A^ — l)/iV^. Hence, t* increases 

in such a way that, for large N, the order of magnitude of the characteristic 
fimction is ~ 1/VN. Substituting this expression in Eg. di^ we get 

lim t*m*)\^ \un (45) 



which correspond to condition (|42p . On the other hand, since the true cumula- 
tive distribution is continuous, then \4>{t)\ > for any finite value of <, and this 
imples that t* tends to infinity for large N, as required by the condition (PT|) . 

In conclusion, the above theorem gives the sufficient conditions for the 
asymptotic consistency of the estimate, especially concerning the finite inter- 
val of frequencies set by the bound t* . From the above arguments, we expect 
the recipe for t* used in simulations to guarantee the asymptotic convergence 
of the estimate. 
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